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Comparing De Novo Transcriptome Assemblers 
Using Illumina RNA-Seq Reads 


ZHAO Lei, Zachary LARSON-RABIN, CHEN Si-Yun, GUO Zhen-Hua™ 
(Plant Germplasm and Genomics Center, Germplasm Bank of Wild Species, Kunming Institute of Botany , 
Chinese Academy of Sciences, Kunming 650201 , China) 


Abstract; In this study, we carried out a systematic comparison of the de novo transcriptome assembly performance 
of eight assemblers (ABySS, Velvet, SOAPdenovo, Oases, Trinity, Multiple-k, T-IDBA and Trans-ABySS) , pro- 
cessing Illumina RNA-Seq reads from F1 hybrids ( Drosophila MS) of Drosophila melanogaster and Drosophila sech- 
ellia and cultivated rice. Our study showed that Trinity and Trans-ABySS were the most effective for producing tran- 
scriptomes from our trial datasets using single k-mer and multiple k-mer methods, respectively, although the per- 
formance levels of the other tested assemblers were comparable. We found that using single k-mer assemblers gener- 
ally produced fewer total numbers of bases than multiple k-mer assemblers, although even the best assembler’s re- 
sults showed lower quality than some researchers may desire. Therefore, we developed and tested a novel de novo 
transcriptome assembly method, ETM, which employs a combination of multiple k-mer tools with Trinity assembler. 
The ETM method yielded superior results from our trial datasets. Our results will assist the growing number of tran- 
scriptome projects using Illumina RNA-Seq reads and provide guidelines for choosing appropriate software. 


Key words: High-throughput; NGS; Transcriptome; de novo assembly; Optimized 
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A transcriptome is the set of all RNA molecules 
present in a given cell or tissue at a given time, and 
the sequencing of a transcriptome is vital to fully un- 
derstand the genetics and biology of that cell, tissue, 
or organism. Although transcriptome sequencing and 
analysis have been possible for more than 20 years, 
recent technological advances have made the process 
much faster and cheaper. It has become feasible to 
quickly assemble transcriptomes and quantify tran- 
script abundance in species for which no genome se- 
quence has yet been constructed (i. e., de novo tran- 
scriptome assembly) , using so-called next-generation 
sequencing technologies ( NGS). There are currently 
three types of NGS platforms, including Roche 454 
FLX, Illumina, and Applied Biosystems SOLID. The 
Roche 454 platform generates reads of about 330 bp, 
and such relatively long read lengths are useful for de 
novo assembly since they improve the accuracy of 
mapping repetitive regions. Unfortunately, it is more 
expensive than the other two platforms and produces 
higher error rates in homopolymer repeats ( Metzker, 
2010; Shendure and Ji, 2008). The ABI SOLID plat- 
form is economical, but generates reads that are often 
too short to be well-suited for de novo assembly ( An- 
sorge, 2009). The Illumina system is also relatively 
inexpensive, but generates moderate read lengths 
(75-100 bp) that suffice for de novo assembly. 
Thus, it is the most widely used platform for NGS 
and has been recommended for de novo assembly 
(Fullwood et al., 2009; Simon et al., 2009). 

The “RNA-Seq” method of transcriptome ana- 
lysis makes use of the high numbers of short reads 
produced by the NGS platforms. RNA-Seq employs 
a “shotgun” -style approach both to sequence a 
transcriptome and to quantify the abundance of its 
transcripts, obviating the need for BAC and cDNA 
cloning ( Wang et al., 2009; Haas and Zody, 
2010). The RNA-Seq method starts with the high- 
throughput sequencing of double-stranded cDNA 
fragments, followed either by direct mapping of the 
sequences to a reference genome, or by their de novo 


assembly. This results in a comprehensive transcrip- 


tome analysis that includes transcripts’ untranslated 
regions ( UTRs) and alternative-splicing isoforms, 
as well as non-coding RNAs (ncRNAs). Since the 
abundance of each transcript is quantified, RNA-Seq 
can be used to assess the differential expression of 
genes in both physiological and pathological condi- 
tions, or to compare the transcriptomes of different 
tissue types, developmental stages, or genotypes 
(Nagalakshmi et al., 2008; Mortazavi et al., 2008). 

The use of a reference genome to reconstruct a 
transcriptome is termed ‘ genome-guided assembly’ , 
while ‘de novo assembly’ attempts transcriptome as- 
sembly without such aid ( Garber et al., 2011). Ge- 
nome-guided approaches, such as Cufflinks ( Trap- 
nell et al., 2010) and Scripture ( Guttman et al., 
2010) , first map all of the reads to a reference ge- 
nome, and then merge the overlapping reads into 
transcripts. However, the increasing interest in re- 
searching non-model organisms renders de novo as- 
sembly approaches increasingly important. In recent 
years, many programs have been developed specific- 
ally for de novo assembly of RNA-Seq reads genera- 
ted by Illumina (Table 1). 

In this study, we compared eight free de novo 
assemblers to assist the growing number of transcrip- 
tome projects using Illumina RNA-Seq reads (Table 
2): ABySS (Simpson et al., 2009; Birol et al., 
2009) , Velvet (Zerbino and Birney, 2008) , SOAP- 
denovo (Li et al., 2009), Oases (D. R. Zerbino, 
European Bioinformatics Institute ), Trinity ( Grab- 
herr et al., 2011), Multiple-k ( Surget-Groba and 
Montoya-Burgos, 2010 ), T-IDBA ( Peng et al., 
2011) and Trans-ABySS (Robertson et al. , 2010) , 
each of which is based on De Bruijn graph algorithms 
(Miller et al., 2010; Zerbino and Birney, 2008). 
Our study did not compare commercial software such 
as SeqMan (Santure et al., 2011), CLC ( Wickra- 
masinghe et al., 2011; Kumar and Blaxter, 2010) , 
or Rnnotator (Martin et al., 2010), although these 
programs can also process Illumina reads. We uti- 
lized the publicly available Illumina RNA-Seq reads 
from F1 hybrids (Drosophila MS) of D. melanogaster 
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Table 1 Software programs previously used for de novo transcriptome assembly using Illumina RNA-Seq reads 


Tool 


Organism 





ABySS 


Tumor ( Birol et al., 2009) ; Conus bullatus (Hu et al., 2011); Burmese pythons ( Wall et al., 2011) 


Chinese hamster ovary ( Birzele et al., 2010) ; Anopheles funestus (Crawford et al., 2010) ; Eucalyptus tree ( Mizrachi et al., 
Velvet 2010) ; Pachycladon enysii (Collins et al., 2008) ; Aedes aegypti and Anopheles gambiae (Gibbons et al. , 2009) ; Medicago 
sativa ( Yang et al., 2011) ; Potato (Hamilton et al., 2011) ; Ruditapes philippinarum (Ghiselli et al., 2011 ) 


Whitefly (Wang et al., 2010b) ; Sweetpotato ( Wang et al., 2010c) ; Nilaparvata lugens (Xue et al., 2010) ; Migratory locust 


(Chen et al., 2010) ; Lateolabrax japonicus ( Xiang et al., 2010) ; Camellia sinensis (Shi et al., 2011) ; Salvia miltiorrhiza 


( Wenping et al., 2011) ; Populus euphratica (Qiu et al., 2011) ; Dugesia japonica (Qin et al., 2011) ; Siraitia grosvenorii 


(Tang et al., 2011) ; Acacia auriculiformis and Acacia mangium (Wong et al., 2011); Taxus (Hao da et al., 2011) 


Chickpea (Garg et al., 2011) ; Sheep (Jager et al., 2011) ; Eichhornia paniculata (Ness et al., 2011) ; Radix balthica ( Feld- 


meyer et al., 2011) ; Xiphophorus maculatus (Garcia et al., 2011) ; Pandalus latirostris ( Kawahara-Miki et al., 2011) 


SOAPdenovo 
Oases 
Trinity Schizosaccharomyces pombe ( Grabherr et al., 2011) 
Multiple-k Loricaria (Surget-Groba and Montoya-Burgos , 2010 ) 
T-IDBA Embryonic stem cells (Peng et al., 2011) 


Trans-ABySS Liver ( Robertson et al., 2010) 


Table 2. Characteristics of de novo RNA-Seq assembly programs 





Assembler Strategy Cost k-mer parameter URL 
ABySS DBG Free Single k-mer www. begsc. ca/platform/bioinfo/ software/ abyss 
Velvet DBG Free Single k-mer www. ebi. ac. uk/ ~ zerbino/velvet/ 
SOA Pdenovo DBG Free Single k-mer http: //soap. genomics. org. cn/ soapdenovo. html 
Oases DBG Free Single k-mer www. ebi. ac. uk/ ~ zerbino/oases/ 
Trinity DBG Free Single k-mer http ; //trinityrnaseq. sourceforge. net 
Multiple-k DBG Free Multiple k-mer www. surget-groba. ch/download/stm. tar. gz 
T-IDBA DBG Free Multiple k-mer http; //i. es. hku. hk/ ~ alse/hkubrg/projects/tidba/ 
Trans-A BySS DBG Free Multiple k-mer www. begsc. ca/platform/ bioinfo/ software/ trans-abyss 


Abbreviations; DBG=De Bruijn graph; K-mer=a sequence “word” of k nucleotides in length 


and D. sechellia, and cultivated rice transcriptomes 
(McManus et al., 2010; Zhang et al., 2010) to ex- 
ploit the comprehensiveness of the genomic and pro- 
teomic annotation of these model species. Our goals 
were to gather and compare assembler performance 
data, including requirements for time and RAM 
(Random Access Memory) of the computers; the 
maximum, mean and median sizes of the tentative 
consensus transcript sequences (TCTSs) and their 
length distributions; as well as the N50 and N90 
summary statistics of each dataset. We also compared 
the assembly accuracy, integrity, coverage and mis- 
matches ratios resulting from each program, including 
the percentages of reads successfully mapped back to 
the assemblies, and the efficiency of de novo tran- 
scriptome assemblers in dealing with alternative spli- 
cing. We compiled the various factors and used these 


data formulate suggestions for researchers to consider 


when choosing appropriate de novo transcriptome as- 
semblers. Finally, we designed a new workflow for 


optimized assembly using multiple programs. 


Materials and methods 
Transcriptome reads and reference datasets 
RNA-Seq paired-end reads of the Drosophila MS 
and cultivated rice transcriptomes produced on the Il- 
lumina sequencing platform were obtained from the 
NCBI Gene Expression Omnibus (http://www. ncbi. 
nlm. nih. gov/geo) under accession numbers GSE20421 , 
GSE16631 and GSE16507 and used for de novo tran- 
scriptome assembly (McManus et al., 2010; Zhang 
et al., 2010). There were 4 019 544 876 (4G) 37- 
bp paired-end sequence reads, 5 982 700 350 (6G) 
75-bp paired-end sequence reads, with sequence 
depths of 25x and 15x, in the Drosophila MS and 


cultivated rice datasets, respectively. Genome, pro- 
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tein, and gene datasets for use as reference se- 
quences were obtained from FlyBase ( www. flybase. 
org) for D. melanogaster and from MSU Rice and 
RAP (http://rapdb. dna. affre. go. jp/download/in- 
dex. html) for Oryza sativa. 

Assembly parameters and extraction 

For each single k-mer assembler, different series 
of k-mer parameter were set (k=23, 25, 27, 29, 31) 
(see Additional File 1, Table S1, http://journal. 
kib. ac. cn/ UserFiles/File/2012-084_editing(1).rar). 
For each multiple k-mer assembler, incremental mul- 
tiple k-mer parameter combinations were assigned (k 
=(23, 25), (23, 25, 27), (23, 25, 27, 29), 
(23, 25, 27, 29, 31)) (see Additional File 1, Ta- 
ble S1, http://journal. kib. ac. cn/UserFiles/File/ 
2012-084 _ editing (1). rar). Each assembled TCTS 
( 2100 bp in length) from each program and dataset 
was extracted and analyzed by using the mathematical 
statistical method contained in our Perl script targe- 
ting TCTS length distributions (Tables 3 and 4; Ad- 
ditional File 2, http://journal. kib. ac. cn/UserFiles/ 
File/2012-084_editing(1).rar). The best assembly 
produced by each assembler was then chosen for com- 
parison. 

All programs were run on a supercomputer com- 
prising 16 Quad-Core AMD 2.2 GHz CPUs with 128 
GB memory. The X_86 64 bit operating system used 
SUSE Linux Enterprise Server 10. 


Mapping and alignment 

To obtain the percentage of reads utilized in 
each assembly, Bowtie was used to map all reads 
back to each transcriptome (Tables 3 and 4). All pa- 
rameters were set to -n 2 -l1 28 -e 70 (cf. Bowtie in- 
struction manual) (Langmead et al., 2009). To de- 
termine accuracy and integrity statistical estimates, 
the NCBI Standalone BLAST 2. 2.25 was used to a- 
lign each assembly to the D. melanogaster and O. sa- 
tiva protein sequence sets, using an E-value thresh- 
old of le-5 (Altschul et al., 1990; Mount, 2007). 
For the alignment of each query sequence, only the 
best-scoring hit was used, then mismatches rate ( to- 
tal mismatched bases/total de novo assembled bases ) 
was computed using Perl script (see Additional file 
6, http://journal. kib. ac. cn/UserFiles/File/2012- 
084_ editing (1). rar). Assembly accuracy (% of 
TCTSs hit) and integrity (% of proteins hit) ( Ta- 
bles 5 and 6) were then calculated using Perl script 
( Additional file 3, http://journal. kib. ac. cn/ 
UserFiles/File/2012-084_editing(1).rar). For the 
percentages of TCTSs achieving full coverage, the 
UCSC Standalone BLAT v. 34 (Kent, 2002) was 
used to align all TCTSs to the D. melanogaster and 
O. sativa gene sequences, and the maximum intron 
size was set to 3 000 bp (Deutsch and Long, 1999; 
Atambayeva et al., 2008). In the alignment of each 


query sequence, only the best-scoring hit was used, 


Table 3 Statistical characteristics for TCTSs assembled by each program using Drosophila MS RNA-Seq reads 





AB Ve SO Oa Tr Mk TI TA 
TCTS 2100 bp 69 477 60 112 66 407 46 929 21 662 126 212 90 202 164 798 
TCTS $200 bp 44 005 38 241 36 358 13 916 0 72 539 69 971 48 407 
TCTS21 kb 801 432 7218 17 358 4748 1 294 125 11 079 
Max 3 800 2 873 10 974 22 442 8519 2 873 2 734 4 326 
Mean 232 221 432 1065 773 243 174 425 
Median 160 163 177 604 556 178 137 318 
NSO 273 246 932 2 081 943 285 173 583 
N90 117 118 140 551 379 124 108 200 
% of reads mapped 28.9 22.3 31.3 56.6 78.6 46.3 27.5 64.3 
Multi-hit reads 25 0 0 26 399 8 557 360 0 0 0 6 902 154 


Abbreviations: AB; ABySS; Ve: Velvet; SO; SOAPdenovo; Oa: Oases; Tr: Trinity; Mk; Multiple-k; TI; T-IDBA; TA; Trans-ABySS. First row 


reports assembler names; Rows 2 to 9 report the TCTS size statistics, specifically; the number of TCTSs with size 2100 bp, the number of TCTSs 


with size <200 bp, and the number of TCTSs with size 21 kb. The remaining statistics represent the following TCTS characteristics; maximum sizes , 


mean sizes, median sizes, N50 and N90 sizes, the percentages of reads mapped back to the assemblies, and the numbers of multi-hit reads 
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and mismatches rates were then calculated using rar). The score of each alignment was computed u- 
Perl script (see Additional file 6, http://journal. sing the formula: score = matches-mismatches. The 
kib. ac. cn/UserFiles/File/2012-084 _ editing (1). same method was used when aligning gene models to 


Table 4 Statistical characteristics for TCTSs assembled by each program using cultivated rice RNA-Seq reads 





AB Ve SO Oa Tr Mk TI TA 
TCTS=100 bp 101 295 143 289 112 246 120 504 86 492 220 546 155 501 366 812 
TCTS <200 bp 47 354 86 649 42 312 31 453 0 129 214 91 962 144 556 
TCTS21 kb 6 264 1305 17 679 29 346 26 221 2 037 3 592 34 867 
Max 9 616 5 184 12 950 14 730 11 946 5 184 5 539 12 347 

Mean 358 230 562 735 914 235 255 437 

Median 213 173 270 387 626 178 172 249 

N50 518 254 1 038 1 400 1 247 261 304 695 

N90 151 123 207 293 410 125 119 177 

% of reads mapped 38.7 22.2 36.8 59.8 75.4 25.0 23.2 54.4 

Multi-hit reads >5 0 0 3 195 5 968 148 0 0 0 5 740 958 


Abbreviations: AB; ABySS; Ve: Velvet; SO; SOAPdenovo; Oa; Oases; Tr: Trinity; Mk; Multiple-k; TI; T-IDBA; TA; Trans-ABySS. First row 
reports assembler name; Rows 2 to 9 report the TCTS size statistics, specifically; the number of TCTSs with size 2100 bp, the number of TCTSs with 
size <200 bp, and the number of TCTSs with size 21 kb. The remaining statistics represent the following TCTS characteristics; maximum sizes, 
mean sizes, median sizes, N50 and N90 sizes, the percentages of reads mapped back to the assemblies, and the numbers of multi-hit reads mapped 


back to the assemblies 


Table 5 Accuracy, integrity and mismatches ratio of TCTSs obtained by each assembler using Drosophila MS RNA-Seq reads 





AB Ve SO Oa Tr Mk TI TA 
Accuracy; % of TCTSs hit to a known 75.8 75.3 75.4 76.5 92.5 78.6 78.7 81.6 
D. melanogaster protein sequence 
% of TCTSs showing 80% identity toa 72.9 74.5 73.3 73.8 91.8 78.0 77.8 80.8 
known D. melanogaster protein sequence 
Integrity ; A of D. melanogaster protein 84.6 83.3 85.2 87.1 76.9 86.5 83.1 86.1 
sequences with a TCTS hit 
% of D. melanogaster protein sequences 

70.7 69.6 71.2 73.6 8.3 73.6 69. 1 72.6 
showing 80% identity with a TCTS d i ? 
Mismatches ratio/% 0. 30 0.31 0.31 0.53 0.35 0.35 0.34 0.31 


Notes ; E-value cutoff=1e-5. To measure the accuracy of each assembly program, 23711 D. melanogaster proteins were compared to the TCTSs assem- 
bled by each program using BLASTX. To measure integrity, the same D. melanogaster protein sequences were compared to de novo assembled TCTSs 


using TBLASTN. Mismatches ratio is computed by total mismatched bases/total de novo assembled bases 


Table 6 Accuracy, integrity and mismatches ratio of TCTSs obtained by each assembler using cultivated rice RNA-Seq reads 





AB Ve SO Oa Tr Mk TI TA 
Accuracy ; f TCTSs hit to a k 
ee ee eee, 54.8 49.8 44.7 62.9 82.1 46.9 46.5 68.5 
O. sativa protein sequence 
% of TCTSs showing aye identity to 50.4 46.2 39.9 56.5 74.1 43.2 43.3 63.8 
a known O. sativa protein sequence 
Integrity; % of O. sativa protein 
1.7 81.4 4.5 84.0 87. 3.4 81.7 3.5 
sequences with a TCTS hit 8 6 8 8 
% of O. sativa protein sequences 
44. 44.3 46. 49.0 46. 47.0 44. 48. 
showing 80% identity with a TCTS 6 3 a $ ʻ 3.2 
Mismatches ratio/% 0. 65 0. 64 0.51 0.76 0.78 0.67 0.64 0.68 


Notes ; E-value cutoff=1le-5. To measure the accuracy of each assembly program, 67 393 O. sativa proteins were compared to the TCTSs assembled by 
each program using BLASTX for accuracy. To measure integrity, the same O. sativa protein sequences were compared to de novo assembled TCTSs u- 


sing TBLASTN. Mismatches ratio is computed by total mismatched bases/total de novo assembled bases 
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all TCTSs to determine the number of genes showing 
full coverage. Percentages of TCTSs and genes that a- 
chieved full and partial coverage were calculated 
(Tables 7 and 8) with Perl script (see Additional file 
4, http://journal. kib. ac. cn/UserFiles/File/2012- 
084_editing(1).rar). 
Analysis of alternative splice sites 

Current GTF annotations for D. melanogaster 
and O. sativa were downloaded from Ensemb|Ge- 
nomes (ftp://ftp. ensembl. org/pub/release-64/ gtf/ 
drosophila _ melanogaster/ Drosophila _ melanogaster. 
BDGP5S. 25. 64. gtf. gz and Oryza_sativa. MSU6. gtf. 
gz). De novo assembly transcripts were mapped to ge- 
nome sequences using GMAP ( http ://research-pub. 
gene. com/gmap/) (Wu and Watanabe, 2005) , show- 
ing only the single best scoring alignment, and splice 
levels were detected as recommended for GSNAP 
(GMAP manual). Novel and known splice sites were 


reported from each of the alignments (Fig.5). 


Results 
Time and RAM consumptions 


Each de novo RNA-Seq assembly program was 


initially scored for required RAM consumption and 
for the time needed to complete assembly calcula- 
tions. Both factors depend on the complexity of the 
dataset and the efficiency of the assembly algorithm. 
For both the Drosophila MS and the cultivated rice 
datasets, SOAPdenovo proved most efficient both in 
terms of processing speed and memory usage. In 
contrast, Trinity consumed the most computational 
resources, while the other assemblers displayed in- 
termediate requirements (Figures | and 2). Howev- 
er, it should be noted that analysis efficiency was 
not linked to assembly accuracy. 
Comparisons of TCTS contiguity produced by 
the eight programs 

The eight de novo assembly programs produced 
very different TCTS populations, and a variety of 
summary statistics and graphical representations have 
been used to describe their size characteristics, or 
contiguity. The assembly programs varied widely in 
the size distributions of the TCTSs themselves, while 
also varying between results for the Drosophila MS 
and cultivated rice datasets (Figures 3 and 4). For 
instance, the long and relatively thick tails of TCTS 


Table 7 Summaries of D. melanogaster gene coverage, the percentages of TCTSs covering 95% of TCTS 


matched gene sequences and mismatches ratio for each de novo assembly program 





AB Ve SO Oa Tr Mk TI TA 
% of TCTSs achieving TCTS coverage 31.2 36.9 18.5 40.0 72.9 42.4 18.4 55.8 
Number of genes achieving total gene coverage 31 9 127 1097 549 38 6 146 
Mismatches ratio/% 0.77 0.78 0.21 0.73 0.80 0.86 0. 80 0.85 


Notes; TCTS coverage is at least 95% of each TCTS length matched the gene, and is computed by match length/TCTS length; gene coverage is at 


least 80% of each gene length matched the TCTS, and is computed by match length/gene length. The assembly of each assembler was aligned to 


15 222 D. melanogaster gene sequences using BLAT for TCTS coverage. For gene coverage, the same D. melanogaster gene sequences were aligned 


to the assembly of each assembler using BLAT. Mismatches ratio is computed by total mismatched bases/total de novo assembled bases 


Table 8 Summaries of O. sativa gene coverage, the percentages of TCTSs covering 95% of TCTS 


matched gene sequences and mismatches ratio for each de novo assembly program 





AB Ve SO Oa Tr Mk TI TA 
% of TCTSs achieving TCTS coverage 37.6 37.2 20.3 40.0 56.2 36.4 29.7 36.7 
Number of genes achieving total gene coverage 127 36 625 1172 906 81 56 627 
Mismatches ratio/% 0.22 0.28 0.10 0.21 0.19 0.30 0.31 0.25 


Notes; TCTS coverage is at least 95% of each TCTS length matched the gene, and is computed by match length/TCTS length; gene coverage is at 


least 80% of each gene length matched the TCTS, and is computed by match length/gene length. The assembly of each assembler was aligned to 


56 797 O. sativa gene sequences using BLAT for TCTS coverage. For gene coverage, the same O. sativa gene sequences were aligned to the assem- 


bly of each assembler using BLAT. Mismatches ratio is computed by total mismatched bases/total de novo assembled bases 
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lengths produced by Oases and Trinity for both data- 
sets indicated a more desirable TCTS length distribu- 
tion (i. e., more long TCTSs). Trans-ABySS yiel- 
ded fewer and shorter TCTSs for Drosophila MS than 
while SOAPdenovo showed the 
opposite trend with fewer and shorter TCTSs for the 


for cultivated rice, 


Drosophila MS dataset. Other summary statistics in- 





























cluded measurements of the total numbers of TCTSs 
2100 bp, <200 bp, and = 
1 000 bp) as well as their maximum, mean, and 
median lengths (Tables 3 and 4). Trans-ABySS as- 
sembled the most total bases ( approximately 70 MB 
for Drosophila MS and 160 MB for cultivated rice) , 
although it also produced more of the shorter TCTSs 


(categorized by size: 
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Processing requirements in hours and 





transcriptome assembler using Drosophila MS RNA-Seq reads 
1. ABySS; 2. Velvet; 3. SOAPdenovo; 4. Oases; 5. Trinity; 
6. Multiple-k; 7. T-IDBA; 8. Trans-ABySS 


Fig. 2 Processing requirements in hours and RAM for each de novo 
transcriptome assembler using cultivated rice RNA-Seq reads 
1. ABySS; 2. Velvet; 3. SOAPdenovo; 4. Oases; 5. Trinity; 
6. Multiple-k; 7. T-IDBA; 8. Trans-ABySS 
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with nearly 47% and 58% of total TCTSs for both 
datasets (Figures 3 and 4). Trinity produced the fe- 
west total TCTSs and shorter TCTSs; 
TCTS lengths were 16 752 840 bp and 79 027 325 bp 
for Drosophila MS and cultivated rice, 


yet combined 


respectively , 
supporting the conclusion that the length of each 
TCTS was much longer. The largest and second lar- 
gest maximum TCTS sizes were produced by Oases 
and SOAPdenovo, respectively, for both Drosophila 
MS and cultivated rice (Tables 3 and 4). The lar- 
gest mean and median TCTS sizes were produced by 
Oases for Drosophila MS, but by Trinity for cultivat- 
ed rice, and the second-largest mean and median si- 
zes were produced by Trinity for Drosophila MS and 
Oases for cultivated rice (Tables 3 and 4). 

N50 and N90 have also been commonly de- 
scribed summary statistics for RNA-Seq transcrip- 
tome assembly, each defined as the TCTS length for 
which 50% or 90% of the TCTSs, respectively, are 


TCTS length/bp 


of the same length or larger ( Robertson et al., 
2010). They were determined by sorting all TCTSs 
by length from longest to shortest, and then selecting 
the length of the shortest TCTS such that the sum of 
all TCTSs of equal length or longer was at least 50% 
and 90% of the total length of all TCTSs. N50 and 
N90 estimate the connectivity of the assembly and 
can be used to compare estimated assembly quality , 
with higher values indicating better performance. 
For Drosophila MS, Oases produced the largest N50 
and N90 values at 2 081 bp and 551 bp, respective- 
ly, while Trinity came in a distant second with 943 
bp and 379 bp (Table 3). The same N50 trend was 
detected for cultivated rice transcriptome assembly , 
with Oases scoring best with 1 400 bp and Trinity 
scoring second-best with 1 247 bp (Table 4). The 
biggest N90 TCTS for the cultivated rice dataset, 
however, was produced by Trinity with 410 bp, while 
Oases was second biggest with 293 bp (Table 4). 
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Taken together, the contiguity-based summary 
statistics of these eight assemblers showed that Oa- 
ses, Trinity, Trans-ABySS and SOAPdenovo per- 
formed better than the others, although other factors 
besides TCTS contiguity should also be considered. 
Measured by total numbers of bases, the overall TCTS 
yields produced by the multiple k-mer assemblers 
( Multiple-k , T-IDBA and Trans-ABySS) ranged from 
20% to 60% larger than the yields from the single k- 
mer assemblers ( ABySS, Velvet, SOAPdenovo, Oa- 
ses, Trinity). This suggested that the assembly in- 
tegrity produced by the multiple k-mer assemblers 
was higher, i. e., these assemblers avoided more 
TCTS losses than the single k-mer assemblers. 
Mapping of reads 

Assembler quality depends on the maximal usage 
of reads to unambiguously map the TCTSs. There- 
fore, to compare all assemblers, each of the paired- 
end reads from the Drosophila MS and cultivated rice 
datasets was mapped separately back to each assem- 
bled transcriptome using Bowtie software (http :// 
bowtie-bio. sourceforge. net/index. shtml) ( Lang- 
mead et al., 2009). The Bowtie mapping algorithm 
was used to map short reads with a seed length of 
28, a maximum of two allowed mismatches in the 
seed, and the maximum permitted total quality val- 
ues of 70 at all mismatched read positions throughout 
the entire alignment (cf. Bowtie instruction manu- 
al). The assembly with the highest percentage of 
back-mapped reads was produced by Trinity for both 
datasets (78.6% and 75.4% in Drosophila MS and 
cultivated rice, respectively), followed by Oases 
(56.6% and 59.8%) and Trans-ABySS (64. 3% 
and 54.4% ) (Tables 3 and 4). The fewest reads 
were mapped to the Velvet assembled transcriptome , 
while ABySS, SOAPdenovo, Multiple-k and T-IDBA 
fared comparably well in terms of the ratio of mapped 
reads. The assemblies of Trinity, ABySS, Velvet, 
Multiple-k and T-IDBA showed no reads with multi- 
ple matches (25), suggesting that they generated 
the least redundancy of all assemblers. Conversely , 


Trans-ABySS, Oases and SOAPdenovo seemed to 


produce high assembly redundancy. 
Accuracy , integrity and coverage 

Transcriptome assemblies have been described 
in terms of their accuracy, integrity, and coverage 
with respect to the actual transcriptome being ana- 
lyzed. In this study, accuracy was measured by 
comparing each TCTS with all known coding se- 
quences from D. melanogaster or O. sativa genome 
closely related Drosophila MS and cultivated rice 
species, respectively, and then determining the per- 
centage of all TCTSs which overlapped with known 
coding sequences. Integrity was defined as the total 
percentage of all known proteins for which TCTSs 
were present, i. e., the degree of proteome represen- 
tation by the complete TCTS population. The optimal 
assembler in terms of transcriptome assembly quality 
would be the program that produces the highest com- 
bined accuracy and integrity of the transcriptome. 
After de novo assembling the transcriptomes from the 
Drosophila MS and cultivated rice reads, we used 
BLASTX (ftp ://ftp. ncbi. nlm. nih. gov/blast/execut- 
ables/release/2. 2. 25/blast-2. 2. 25-x64-linux. tar. gz ) 
to compare them to 23 711 D. melanogaster protein se- 
quences (ftp ://ftp. flybase. net/ genomes/ Drosophila_ 
melanogaster/dmel_ 15. 38 _ FB2011 _06/fasta/dmel- 
all-translation-15. 38. fasta. gz) and 67 393 O. sativa 
protein sequences (ftp ://ftp. plantbiology. msu. edu/ 
pub/data/Eukaryotic_Projects/o_sativa/ annotation _ 
dbs/pseudomolecules/version_6. 1/all. dir/all. pep ) 
to calculate assembler accuracy. To compute integri- 
ty, we used TBLASTN to compare the same D. mela- 
nogaster and O. sativa protein sequences to the de no- 
vo transcriptomes assembled by each assembler ( Alts- 
chul et al., 1990; Mount, 2007). For the alignment 
of each query sequence, only the smallest E value hit 
(cutoff=1e-5) was considered. Trinity performed best 
in terms of assembly accuracy (92.5% and 82. 1% 
for Drosophila MS and cultivated rice, respectively) , 
and Trans-ABySS (81.6% and 68.5%) and Oases 
(76.5% and 62.9% ) also fared well (Tables 5 and 6). 
Trinity also obtained the highest integrity (87.6% ) 


on the assembly of sequence reads from cultivated 
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rice, followed by SOAPdenovo (84. 5%), Oases 
(84.0% ) and Trans-ABySS (83.5% ) (Table 6). 
Assembly of the Drosophila MS reads led to different 
results for integrity, with Oases first at 87.1% and 
Multiple-k second at 86.4% , while Trinity was to- 
ward the bottom (76.9% ) (Table 5). 

For further analysis of the quality of de novo 
transcriptome assemblies produced by the eight test- 
ed programs, the TCTS coverage and gene coverage 
were calculated (Tables 7 and 8). The UCSC Blat 
(Kent, 2002) software ( http://genome-test. cse. 
ucsc. edu/ ~ kent/exe/linux/blatSuite. 34. zip ) was 
used to align all TCTSs to 15 222 D. melanogaster 
gene sequences ( ftp://ftp. flybase. net/genomes/ 
Drosophila_melanogaster/dmel_r5. 38 _FB2011_06/ 
fasta/dmel-all-gene-r5. 38. fasta. gz) and 56 797 O. 
sativa gene sequences (ftp://ftp. plantbiology. msu. 
edu/ pub/ data/Eukaryotic_Projects/o_sativa/annota- 
tion_dbs/pseudomolecules/version_6. 1/all. dir/all. 
seq) to determine TCTS coverage. Meanwhile, the 
same D. melanogaster and O. sativa gene sequences 
were aligned to the actual transcriptomes to deter- 
mine gene coverage. For the alignment of each query 
sequence, only the highest-scoring hit (see Materi- 
als and Methods) was used. Trinity showed the 
highest percentage of all TCTSs that covered at least 
95% of each TCTS (72.9% and 56.2% for Dro- 
sophila MS and cultivated rice, respectively) , fol- 
lowed by Trans-ABySS (55.8% and 36.7% ) and 
Oases (40.0% for both) (Tables 7 and 8). These 
results resembled those obtained by using BLASTX, 
supporting the conclusion that the de novo assembly 
of transcriptome obtained from Trinity possessed 
the highest accuracy. The program yielding TCTSs 
covering at least 80% of the most genes was Oases 
(1097 and 1 172), followed by Trinity (549 and 
906) , Trans-ABySS (146 and 627) and SOAPde- 
novo (127 and 625), indicating these programs 
have better capacity to reconstruct more full-length 
transcripts (Tables 7 and 8). 

Mismatches ratios and alternative splicing 


Mismatches ratios were compared as another 


way to assess the performance of these de novo as- 
semblers. These data were obtained by aligning each 
de novo assembly transcriptome to the protein and 
gene databases and using BLASTX (Tables 5 and 6) 
and Blat (Tables 7 and 8) (cf. Materials and Meth- 
ods). Oases showed the highest mismatches rate 
(0.53% ) , and ABySS displayed the lowest mismat- 
ches rate (0.30% ) for the Drosophila MS transcrip- 
tome (Table 5), while Trinity showed the highest 
mismatches rate (0. 78%), and SOAPdenovo dis- 
played the lowest mismatches rate (0.51% ) for cul- 
tivated rice (Table 6). The mismatches rates pro- 
duced by the other assemblers were comparably low 
to the least effective for each organism. Additional- 
ly, when Blat aligner was used, Multiple-k showed 
the highest mismatches rate (0. 86% ) , and SOAPd- 
enovo displayed the lowest mismatches rate (0.21% ) 
for Drosophila MS (Table 7) , while T-IDBA showed 
the highest (0.31% ), and SOAPdenovo displayed 
the lowest mismatches rate (0.10% ) for cultivated 
rice (Table 8). 

Alternative splicing is the one of the biggest 
challenges in RNA-Seq de novo assembly ( Martin 
and Wang, 2011; Wang et al., 2010a). As such, 
we compared the efficiency of alternative-splicing 
resolution among these de novo assemblers. Each as- 
sembled transcriptome was mapped separately to D. 
melanogaster (ftp://ftp. flybase. net/genomes/ Dro- 
sophila_melanogaster/dmel_r5. 38_FB2011_06/fas- 
ta/dmel-all-chromosome-r5. 38. fasta. gz) and O. sa- 
tiva ( ftp://ftp. ensemblgenomes. org/pub/plants/ 
release-11/fasta/oryza _ sativa/dna/Oryza _ sativa. 
MSU6. dna. toplevel. fa. gz) genome sequences to 
detect novel and known splices using GMAP (Wu 
and Watanabe, 2005) software (cf. Materials and 
Methods). We compared the numbers of novel and 
annotated splice sites captured by each assembler 
(Fig. 5). In Drosophila MS, Trans-ABySS identi- 
fied the most known splices (73 320) , followed by 
Oases (44 810) and Multiple-k (34 430); then 
Trans-ABySS identified the largest number of novel 
splices (32 876) , followed by Oases (24 162) and 
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Multiple-k (18 645). In cultivated rice, Trans-ABySS 
still identified the largest number of known splices 
(265 228), followed by Oases (136 884) , Trinity 
(132 213); then Trans-ABySS still the largest 
number of novel splices (122 613) , followed Oases 
(77050) , and Trinity (75 170). In terms of process- 
ing alternatively spliced transcripts, Trans-ABySS, 
Oases and Trinity showed the best performance; oth- 
ers were closely comparable (Fig. 5). 
Combining assembler programs into a novel workflow 
Despite the burgeoning number of de novo 
RNA-Seq assemblers, transcriptome assembly re- 
mains challenging ( Wang et al., 2010a; Nagarajan 
and Pop, 2009; Ozsolak and Milos, 2011; Martin 
and Wang, 2011). Many complicating factors are 
regularly encountered, such as short read lengths, 
poor read quality, low sequencing depth, and alter- 
natively spliced transcripts ( Paszkiewicz and Stud- 
holme , 2010; Wall et al., 2009). As such, the us- 
er-defined single k-mer programs are unlikely to 
yield optimal performance for de novo transcriptome 
assembly, but even the multiple k-mer programs can 


fall short of satisfactory results (Garber et al., 2011; 





Zerbino, 2010). Therefore, a strategy employing a 
combination of multiple k-mer programs should be 
produced to increase assembly success ( Surget-Gro- 
ba and Montoya-Burgos, 2010; Robertson et al., 
2010; Martin et al., 2010). To take full advantage 
of the assembly capabilities of different multiple k- 
mer assemblers, we have designed the “ETM” 
method. It consists of three components, beginning 
with the correction of read errors, followed by the 
production of a series of k-mer Trinity assemblies , 
and concluding with a merging step and the removal 
of redundancy (Fig. 6). 

To maximize the quality of the short sequence 
reads, the ETM process begins by applying the ECH- 
O algorithm to correct read errors. ECHO uses a max- 
imum a posteriori model on read overlaps, assigning 
quality scores to the bases to avoid false overlaps 
(Kao et al., 2011). ETM then employs the Trinity 
assembler on the corrected reads, combining differ- 
ent series of k-mers. Finally, these incremental as- 
semblies are merged into the final assembly using 
Multiple-k’ s CD-HIT-EST function ( http://www. 
bioinformatics. org/cd-hit/ ) (Li and Godzik , 2006 ) 





Error Correction of Reads 





























Assembly 






Different k-mer Trinity 
Assemblies 














Merging 








Drosophila MS Cultivated rice 
e ABySS e ABySS 
e Velvet e Velvet 
70 7 © SOAPdenovo 2504° SOAPdenovo 
© Oases © Oases 
© Trinity © Trinity 
© Multiple-k © Multiple-k 
e T-IDBA e T-IDBA 
60 4 * Trans-ABySS Trans-ABySS 
g FZ 2004 
5 5 
2 ‘ PA 
= = 1504 
2 404 2 . 
e 
ő a 
= . {=i 
M M 
304 100 4 
204 
b e 
° 504° 
10 15 20 25 30 20 40 60 80 100 120 


Novel splices (thousands) 


Fig.5 The results of the alternative 


splicing analysis 


Novel splices (thousands) 


Fig.6 Summary of the ETM 











assembly workflow 


498 E yrk REFR 


E 34 & 








to remove redundancy and retain the best possible 
assembly (see Additional file 5, http://journal. kib. 
ac. cn/ UserFiles/ File/2012-084_editing(1).rar). In 
this study, four ETM assemblies from each of the 
two datasets were produced using the incremental 
Trinity assemblies as input (Table S1, http://jour- 
nal. kib. ac. cn/UserFiles/ File/2012-084_editing(1). 
rar), and then the best of the four assemblies was 
chosen. We found that our ETM assemblies showed 
better TCTS summary statistics than those resulting 
from any of the eight assembler programs, including 
higher values for total bases, N50, N90, and mean 
and median lengths, indicating a substantial im- 
provement in contiguity. Surprisingly, the multi-hit 
reads ( 25) remained absent from the ETM results, 
indicating that little redundancy was produced by 
this method. Accuracy, integrity, coverage and mis- 
matches rates were also superior in the ETM products 
(Table 9). These results suggested that our strategy 
of combining multiple k-mer programs in a serial 
workflow was advantageous for de novo transcriptome 


assembly on non-model organisms. 


Discussion and conclusions 

We compared the de novo assembly performance 
of eight assembly programs, using two datasets pro- 
duced by the Illumina sequencing platform. For 
each de novo assembly program, we varied the k-mer 
parameters (see Additional File 1, Table S1, ht- 
tp://journal. kib. ac. en/UserFiles/File/2012-084 _ 
editing(1).rar) and chose the best assembly results 
for the comparisons. Trinity showed the best per- 
formance of the single k-mer tools, followed by Oa- 
ses and SOAPdenovo. Trans-ABySS performed best 
of the multiple k-mer programs, although the others 
were comparably effective. Additionally, we designed 
the ETM method, and found it to be well-suited for 
de novo transcriptome assembly. 

The quality of de novo assembly transcriptome 
can be estimated by using the information of closely 
related species (Kumar and Blaxter, 2010; Wang et 
al., 2010c; Ness et al., 2011; Feldmeyer et al., 


Table 9 Analysis of ETM assemblies for Drosophila MS 


and cultivated rice transcriptomes 


Drosophila MS 


Cultivated rice 











TCTS=100 bp 45 424 138 637 
TCTS $200 bp 0 0 
TCTS21 kb 10 723 46 276 
Max 8 519 13 038 
Mean 796 969 
Median 588 670 
N50 972 1 344 
N90 393 429 
% of reads mapped 80.2 77.0 
Multi-hit reads ( >5) 0 0 
% of TCTSs hit 93.5 84.5 
% o "TOTS; sharing 807 92.9 74.7 
identity to a known protein 
% of proteins hit 78.2 89.2 
% of proteins showing 80% 

58. 57.5 
identity with a TCTS 2 
U aoia ratio ( % ) 0.34 0.77 
using BLASTX 
% of TCTSs with full TCTS 73.9 58.7 
coverage 
Number of genes with full 571 1173 
gene coverage 
AU a ratio (% ) 0.84 0.18 
Using Blat 
Novel Splices 21 427 39 893 
Know Splices 123 888 216 506 


Notes ; E-value cutoff =1e-5. TCTS coverage represents the percentage 
by which TCTS length exactly matched the gene coding sequence, 
computed by match length/TCTS length; gene coverage is at least 
80% of each gene length matched the TCTS, and is computed by 
match length/ gene length. Mismatches ratio is computed by total mis- 


matched bases/total de novo assembled bases 


2011). To evaluate and compare the performance of 
the programs as well as our ETM workflow, we took 
advantage of D. melanogaster or O. sativa for refer- 
ence. Due to no true genomes for Drosophila MS and 
cultivated rice species, and exist the divergence of 
sequence and structure to D. melanogaster and O. sa- 
tiva from the aspect of evolution, so, genome-guided 
assemblers ( Cufflinks and Scripture ) were not used 
for getting the TCTSs of Drosophila MS and cultivated 
rice; otherwise, if done, wrong transcripts may be 
produced, then the assessment may also not be accu- 
rate. Therefore, genome-guided assemblers ( Cuf- 


flinks and Scripture) were not compared with de novo 
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assemblers. Since this study was designed to compare 
de novo transcriptome assembly programs using mR- 
NA sequencing reads, D. melanogaster and O. sativa 
protein and gene datasets can provide excellent refer- 
ence sequences. To support this conclusion, each 
transcriptome de novo assembled was also aligned to 
D. melanogaster and O. sativa genome sequences, and 
the results were shown (see Additional File 1, Tables 
52-55, http://journal. kib. ac. cn/UserFiles/File/ 
2012-084_editing(1).rar). As we known, sequence 
depth has various effects on the accuracy of de novo 
assembly (Garber et al., 2011; Martin and Wang, 
2011). The main purposes of this study were to com- 
pare the efficiency of each de novo assembler based 
the same trial datasets, so sequence depth was not 
deeply explored. Generally speaking, sequence depth 
should achieve at least 30 x to obtain a high quality 
de novo transcriptome assembly ( Martin and Wang, 
2011). When the reads were mapped back to the as- 
sembly, the threshold value for multiple hit reads 
( 25) allowed estimation of the possible redundancy 
of the assemblies, since many sequence reads likely 
derived from transcripts of gene family members. 
Here, we performed a systematic comparison 
study to evaluate de novo assembly programs accord- 
ing to a series of summary statistics, including conti- 
guity, accuracy, integrity, and coverage and mis- 
matches rate. Using our results, researchers may 
choose the appropriate assembler based on their spe- 
cific experimental data and the computational re- 
sources available. Many factors may affect de novo 
transcriptome assembly, however, such as the pres- 
ence of gene duplications and alternatively spliced 
transcripts. To accommodate the variability in tran- 
script abundance intrinsic to gene expression, and to 
improve on assembly quality, we recommend a serial 
multiple k-mer strategy such as our ETM method to 
obtain maximal transcriptome coverage. Such a strat- 
egy should also be considered when designing future 
assembly programs, since our ETM method performed 
better transcriptome assembly than any single program 


currently available. In addition, each NGS platform 


has certain advantages and disadvantages, and assem- 
bly performance could be improved by combining the 
best complementary advantages of the different NGS 
technologies and de novo assembly tools ( Adamidi et 
al., 2011; Iorizzo et al., 2011; Su et al., 2011; 
Angeloni et al., 2011; Sandmann et al., 2011). 
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